A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives
نویسندگان
چکیده
Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these methods in real data has been lacking. Here, we report an assessment of 11 state-ofthe-art relatedness inference methods using a dataset with 2,485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (∼93% − 99%) when reporting first and second degree relationships, but their accuracy dwindles to less than 60% for fifth degree relationships. However, the inferred relationships were correct to within one relatedness degree at a rate of 83% − 99% across all methods and considered relationship degrees. Furthermore, most methods infer unrelated individuals correctly at a rate of ∼99%, suggesting a low rate of false positives. Overall, the most accurate methods were ERSA 2.0 and approaches that classify relationships using the IBD segments inferred by Refined IBD and IBDseq. Combining results from the most accurate methods provides little accuracy improvement, indicating that novel approaches for relatedness inference may be needed to achieve a sizeable jump in performance. The recent explosive growth in sample sizes of genetic datasets has led to an increasing proportion of close relatives hidden within these large studies, necessitating relatedness detection. Inferring relatedness between samples is an essential step in performing genetic association studies and linkage analysis, is a powerful tool for forensic genetics, and is needed to account for or remove relatives in population genetic analyses. Relatedness estimation has also drawn the interest of the general public via companies such as 23andMe and AncestryDNA which advertise their ability to find and report relatives, allowing individuals to explore their ancestry and genealogy. The broad utility of relatedness estimation has motivated the development of numerous methods for such inference. These methods work by estimating the proportion of the genome shared identical by descent (IBD) between individuals or a closely-related quantity, where an allele in two or more individuals’ genomes is said to be IBD if those individuals inherit it from a recent common ancestor. As previously shown, the distributions of IBD proportions for different relatedness classes (such as first cousins and half-first cousins) are expected to overlap, posing a challenge for these inference procedures. Here, we present a rigorous evaluation of 11 state-of-the-art methods that can scale to large study sizes, including seven that directly infer genome-wide relatedness measures and four IBD segment detection methods that we utilized to infer these quantities. To assess each of these methods, we used SNP array genotypes from Mexican American individuals contained in large pedigrees from the San Antonio Mexican American Family Studies (SAMAFS). Our analysis sample included 2,485 individuals genotyped at 521,184 SNPs (Supplemental Note) within pedigrees that span up to six generations with genotype data ∗Correspondence: [email protected] (M.D.R.), [email protected] (A.L.W) 1 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/106013 doi: bioRxiv preprint first posted online Feb. 4, 2017;
منابع مشابه
Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives
Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these approaches in real data has been lacking. Here, we report an assessment of 12 state-of-the-art pairwise relatedness inference methods using a data set with 2485 indivi...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملRelatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium.
Estimates of relatedness have several applications such as the identification of relatives or in identifying disease related genes through identity by descent (IBD) mapping. Here we present a new method for identifying IBD tracts among individuals from genome-wide single nucleotide polymorphisms data. We use a continuous time Markov model where the hidden states are the number of alleles shared...
متن کاملInference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State
It is an assumption of large, population-based datasets that samples are annotated accurately whether they correspond to known relationships or unrelated individuals. These annotations are key for a broad range of genetics applications. While many methods are available to assess relatedness that involve estimates of identity-by-descent (IBD) and/or identity-by-state (IBS) allele-sharing proport...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017